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Abstract—The basic question of how to optimally make use of a finite number of available samples in designing 
pattern recognition systems is considered. This has several components: optimal use of the samples for design and 
testing; and the relationship between the number of measurements and the number of samples for various prob- 
ability structural constraints. A spectrum of possibilities has been demonstrated, placing several apparently 
conflicting recent results in perspective. 


I. INTRODUCTION 


SOME questions on dimensionality and sample size which arise in the statistical approach to 
the design of pattern classification systems are: what is the best way to use a fixed size 
sample to design a classification system and evaluate its performance? When a certain finite 
number of samples is available what should be the dimensionality of the pattern vector, i.e. 
how many variables should be used, and if one can get as many samples as one wants, can 
the probability of error be made arbitrarily small by increasing the number of variables? 

Surprising as it may seem now, in much of the earlier work in pattern classification, 
especially that based on adaptive algorithms, the entire set of available samples was first 
used for design and future performance was then predicted to be that achieved on this 
design set. By now it is well known that this procedure is biased, resulting in too optimistic 
an estimate of performance. 

The choice between competing design procedures can only be based on predicted 
performance. We would like the ranking of procedures based on performance estimated 
from a fixed size sample to correspond to the ranking that would occur given actual per- 
formance. Moreover we want the estimated performance of the system finally selected to be 
a ‘‘good”’ predictor of its actual performance. Both the sizes of the design and test sample 
sets influence the accuracy of these estimates. We are then faced with the problem of the 
optimum use of a fixed size sample for maximizing the accuracy of these estimates. This 
can be considered without reference to the specific competing design procedures. 

Pattern vector dimensionality enters into the effectiveness of the design based on finite 
samples. In statistical classification, estimation, and prediction, it has often been noted that, 
with finite samples, performance does not always improve as the number of variables is 
arbitrarily increased. Sometimes it may even deteriorate. This, added to the increase in 
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system complexity for more variables, makes the relationship of pattern vector dimension- 
ality to design sample size, worth investigation. 

Some answers to these questions have been previously proposed. In this paper we (1) 
examine the specific formulation and applicability of the proposed solutions and consider 
modifications which make the results more useful; (2) formulate a trade-off between the 
number of variables and the number of samples, and (3) consider the role of structural 
assumptions in questions concerning dimensionality and sample size. 


Il. DESIGN AND TESTING 


Suppose we have alternate design procedures applicable to a problem and have a finite 
sample set. Eventually we want to choose the best design procedure and then use it to 
design a system with all available samples. But first we do the alternate designs using only 
part of the total sample set and then test their performance on the remainder. This method is 
hereafter referred to as the holdout or H method. 

For the selection of the best design procedure based on performance data we need to 
maximize our confidence in the test results obtained. On the other hand we would like the 
systems designed with a subset of the available samples to reflect faithfully the design 
based on the entire set. This leads to the problem of optimum use of a sample set to maxi- 
mize overall confidence in the design and testing of the system. 

To our knowledge, HIGHLEYMAN'” is the only publication presenting analytical results 
for this problem. The paper suggests partitioning the total sample set into disjoint design 
and test sets and obtains the optimum partitioning to minimize the variance of the estimated 
error rate. The assumptions underlying the analysis are: the error rate, e, can be expressed 
as a function of a finite number of estimated parameters; it can be expanded into a Taylor 
series about the error rate, e9, of the optimum system based on infinite design samples; 
and the deviations (e — é9) are small enough for terms, in the expansion, higher than second 
order to be neglected. The result of the analysis is a set of curves showing the fractions of the 
available sample set, which should be used for design and testing, versus a function of the 
ideal error, é@), the total number of samples, t, and a quantity which measures the effective- 
ness of the design. 

The H method analysed in (1) and the recommendations given there are interesting. 
However, it has not been recognized that they are valid only when the total sample size is 
large. When a prescription is most needed, i.e. in the small sample case, the analysis breaks 
down because the deviations (e—e)) can no longer be considered small enough to be 
approximated by just two terms of the Taylor expansion. In this case the resulting recom- 
mendations are not applicable. Fortunately, approaches other than the H method are 
possible and do apply to the small sample case. They also give a better estimate of the error 
rate for the system designed with all the available samples, something for which there is no 
provision in the H method. 

We suggest the use of the following method, which is the most promising general pro- 
cedure among the various procedures that have been experimentally evaluated in 
LACHENBRUCH™). The problem considered in this paper is more general than the one posed 
in (1), but this only enhances the usefulness of the results. If mis the total number of samples, 
take all possible partitions of size 1 for the test set and m— 1 for the design set. This results in 
successively omitting one sample in the design procedure. The estimates of error obtained 
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are unbiased estimates of the error rate for a design based on m—| samples. Following (2), 
we term this the U method. 

The H method is quite uneconomical with the use of samples ; the U method is definitely 
superior in this respect. However, it requires the computation of m designs for each candi- 
date procedure. If the design calls for the inversion of a covariance matrix, the prospect of 
computing m inverses might appear formidable. Fortunately an identity by Bartlett leads 
toa method which requires only one explicit inversion for each class covariance matrix (only 
one for the whole problem if covariance matrices are assumed equal). The details are given in 
LACHENBRUCH.:>? 

Having considered the first question mentioned in the introduction we turn to the 
relationship between dimensionality and sample size. We consider first the ideal situation. 


Ill. INFINITE SAMPLE SIZE 


Optical Character Recognition is an example of situations in which the designer has, 
potentially, a large, essentially infinite sample. Models leading to vary optimistic and rather 
pessimistic results have been presented for classification based on an infinite sample set. 
Representative of optimistic results are those presented in Garrey™), ALBRECHT’? and 
Cuu'°?. Reference (4), which appears to have gone unnoticed during the last decade, assumes 
multivariate Gaussian distributions. Its results are sharper than those presented in (6) 
which are more general since they do not invoke the Gaussian assumption. However, 
reference (6) assumes independent variables. Reference (5) evaluates bounds for the error 
rate for independent, Gaussian variables. These references derive conditions whereby the 
probability of error can be made arbitrarily close to zero as the number of variables in- 
creases. 

In the multivariate normal case (assuming equal covariance matrices for the two classes) 
if y\ denotes the mean of the kth variate in class i, and a, its variance, this result is obtained 
under the following conditions. If 


| — w?? 
~ | 


Ay 
oO, | 


and [r“] is the inverse of the correlation matrix, then 


N N ; 
>» > AA, 
k=1j=1 

should diverge as N, the number of variables is increased to infinity. This is a necessary and 
sufficient condition for the probability of error to approach zero with increase in the 
number of variables. If the variables are independent, then a sufficient condition is: A, > 6 
> 0 for all k. An even weaker condition is given by the following. For N > No, 

g-l-bA,>6,>0. 

1sk<Nn 
If the variables are not independent, a sufficient condition is: the lower bound 6, A, > 6 > 0 
exists for all k and that for N = No, 
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If we do not assume normal distributions and let f,(x,) and f,(x,) be the probability density 
functions of the variable x, for classes 1 and 2 respectively, then defining a distance by 


d=] IAd- AO dx, 
Xk 
the condition, }'?_ , d, diverges, is sufficient for the probability of error to be arbitrarily 
close to zero when the number of variables is increased indefinitely. 

The implication of these conditions is that ifeach variable, being considered for inclusion 
in the decision function, contributes ever so little to the discrimination, by using sufficient 
numbers of them, we can eventually get perfect discrimination. A corollary is that, by adding 
another variable, even if we do not do any better, we can never do any worse. 

How much do these optimistic results help us in practice? First the assumption of 
independence for arbitrarily large numbers of observables is unrealistic. In most practical 
problems the observables represent band-limited functions such as two dimensional visual 
patterns and time functions. Even if we are prepared to process all the observables there are 
just so many independent ones available. Secondly, in practice it is even more difficult to 
determine if they are truly independent than if 


¥. d, diverges. 
k=1 


Thirdly, these results do not give any idea of the danger involved in the statement “the more, 
the better” applied to the number of variables, when the number of samples available for 
estimating the decision function is not large. This last point is quite important in a number 
of contexts (e.g. HARLEY”). 

The extreme among pessimistic results is that presented in HuGues.'*) This reference 
considers the problem of finding the average probability of error over all problems with a 
fixed number of measurement states. The number of measurement states can be computed 
as follows. If we have N variables and the kth variable can take on one of r, values, then the 
total number of measurement states is 


N 
[1] *.: 
k=1 


No assumption of statistical independence or dependence between the variables is made, 
nor is there a metric in the measurement space. The latter point, broadly speaking, means 
that no parametric families of distributions can be reasonably fitted on the probability 
functions in the measurement space. This statistical model is general enough and even, as we 
shall see, too general. In this model, in making the decision about class membership, we 
need the probabilities P[S,|a] where S; are the measurement states and « the class index. 
When S, occurs, the class for which P(a)P(S,«) is highest is decided upon as the class from 
which the pattern came. If we know these probabilities, or equivalently if we have an infinite 
number of samples from which to estimate them, then for two class problems with equal 
prior probabilities, reference (8) establishes that, on the average, even if we increase the 
number of measurement states (or the number of variables) indefinitely the probability of 
correct recognition only approaches 0.75. 

This result takes into account that in some problems the variables may be independent, 
in some others correlated ; in some the probability densities may be continuous. The figure 
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values obtained are astonishingly pessimistic. For instance, when 500 samples are available 
from each class, the optimum measurement complexity is 23. For a vector of binary vari- 
ables, this allows less than 5 variables. As shown in reference (10) some interesting variations 
on this behaviour occur for the case of unequal prior probabilities. The existence of an 
optimum measurement complexity is a likely characteristic of all statistical classification 
systems designed with a finite sample. 

Performance as a function of the dimensionality of the variables and the sample has 
been considered by ALtais."'!) Although it deals with the prediction problem assuming 
Gaussian statistics, the important similarities between prediction and pattern classification 
make the results worth discussing. 

The decision function considered in (11) is the maximum likelihood estimator of the 
ideal prediction function, based on known parameters, and derived from a minimum mean 
square error criterion. Assuming an N-dimensional multivariate normal vector with a 
non-singular, in general non-diagonal correlation matrix and assuming that the predictand 
and the variable vector are jointly normal, the expected value of the error, ¢, is given by 


2{m+1 N 2 
p ne! {r+ for N < m—-2 
E(e) = | 00 for N = m-2 
undefined for N => m-—1, 


where m is the number of samples available to estimate the predictor function, and p? is the 
minimum mean square error achievable when all the parameters are known. The ideal error 
p” is in general a monotonically non-increasing function of the dimensionality, N, such 
that if the variables satisfy the conditions of (4), mentioned in the previous section, then 
lim p? = 0. 
Nox 
For N < m—2, there isan optimum value of N at which E(e) isa minimum. The value of this 
optimum dimensionality depends on the particular form taken by p?(N). For instance 
if p?(N) = 1(1+0.03N) and m = 200, 82 variables provide the minimum expected error. 
Increasing the number of variables above this value only decreases the performance of the 
predictor. 


V. A TRADE-OFF PROBLEM 


We now formulate a trade-off between the number of samples and the number of vari- 
ables. In many situations a cost is associated with acquiring one more sample or one more 
block of r samples. A cost is also associated with mechanizing sensor systems to measure one 
more variable. The motivation for considering the trade-off is that even when it is certain 
that increasing the number of variables is going to result in improved performance, it may be 
more cost-effective to achieve the same improvement by increasing the number of samples. 

For illustration, we consider the prediction problem of(11)and we perform the following 
analysis. For simplicity let us ignore the term (m+ 1)/m, which enters into the expression for 
the expected error, so that 

m—2 
m—N-—2 


&(N, m) = P| | for N < m—-2, 
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7 m+i N N = 
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m—2 


a) forN < m—2, 


&N,m) = P| 
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where m is the number of samples and N the number of variables. Let C,, be the cost associ- 
ated with obtaining one more sample and Cy the cost to mechanize the system to measure 
one more variable. Assume that p?, the ideal error, decreases as a function of N in the 
manner p? = 1/(1+aN), so that the N for which the minimum error is attained is 


Ams 2p Tt 
opt ~ 5 Faq 
Reference (11) arrived at this specific function for p?, as an empirical match. While the 
specific numerical results of our analysis in this section might change with a different 
function for p?, the qualitative aspects remain similar. 
The question we consider then is: what are the optimum numbers of samples and vari- 
ables which achieve a given error rate and minimize the cost? Since the cost is C = C,,:m 
+Cy-N, letting the error rate é = E, we can minimize the cost under this constraint. 
Substituting for mand p? from the expressions given above C can be written as 


ive c,| ae +aN)—2 


N. 
E(l+aN)—1 lees 


Setting dC/dN equal to zero gives the optimum number of variables corresponding to the 


error rate E: 
Ne = wg | 0-2) V8 4 tC Sa pa] 


(C,, + Cw) 


For E « 1, this expression can be reduced to 


Bone, - (2E—1)(Cy/C,)+E 
AEN © (1 B)+ Vu og aaa EY ot a =Q. 
For 

+0, > (1-E)+/1-—E =(/1-E)(/1-E+1) > 2; 
for 


Substituting N,,, = Q/aE into the expression for é gives 


‘ -2=2,| 20s] 
on” ™ aE| Q-(1—-E) | 


Thus, for Cy/C,, > 0, moy,—2 > 4/aE = 2No,,; for Cy/C,, + ©, Mop, 7 0. 

These results prescribe, simply, that when variables are cheap relative to samples, having 
specified the optimum number of variables corresponding to the error rate E, we need only 
use twice that many samples; when variables are very expensive compared to samples we 
pick only half as many variables. viz. 1/aE, and use a very large number of samples. 
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VI. PROBABILITY STRUCTURE AND SAMPLE SIZE 


The optimum dimensionality corresponding to a given sample size depends on the 
probability structure assumed for the problem and the correspondence between the assumed 
structure and reality. Further, optimal decision procedures differ for different underlying 
probability structures. For instance, decision procedures involving estimated covariance 
matrices when nothing is known about the dependence or independence of the variables, 
are nonoptimal when in fact the variables are independent. 

The optimum dimensionality for a given sample size increases, in general, with the 
increased structure in the problem formulation, with a corresponding increase in the 
maximum probability of correct recognition. For example with a finite sample size and N 
binary variables, assuming no structure at all the problem is equivalent to the model in (8) 
where the measurement complexity is 2% ; with 500 samples from each class, the optimum 
number of variables is 5 and the optimal probability of correct recognition is 0.74. In 
contrast consider the problem generated by increasing the structure considerably by im- 
posing the following constraints (ABEND"!”, p. 25). Let X = (X,,...Xy) and Y=(Y%,... 
Yy) denote respectively, the pattern vectors from class 1 and class 2, where the X;’s and Y,’s 
are independent, binary random variables with P[x; = 1] = y and P[Y, = 1] = B = (1-7) 
for all i. Let y be estimated using m samples from each class. If sample Z = (Z,,...Zy) of 
unknown origin, to be classified into one of the two classes, contains r ones and N—r 
zeros, the Bayes decision rule is: decide class | if r < N/2 and j <4} r ifr > N/2 and 
4 > 1/2;and decide class 2 ifr < N/2and# > 1/2 orifr > N/2andj < 4. The probability 
of error is derived in (12). Going one step beyond (12) we arrive at the result of interest here. 
It can be verified that limy_,, P, = 0 for finite m and lim,,..,, P; # 0 for finite N. In this 
example, for any finite sample size, arbitrarily increasing the number of variables 
always improves performance. Thus the optimal measurement complexity for this highly 
structured problem is infinite. The case of non-identically distributed independent variables 
lies in between those of (8) and (12). We refer the reader to a recent publication, 
CHANDRASEKARAN'!”?, for some interesting results for this problem. 

Another aspect of the relationship between probability structure, sample size and 
measurement-space dimensionality is brought out rather clearly in any decision procedure 
involving the estimation of covariance matrices. Let a sample of size m be available from an 
N-variate population. Regardless of the actual population covariance matrix, ifN > m—1, 
the estimated covariance matrix is singular. If samples from both classes are pooled to 
estimate a single covariance matrix, then with m now representing the total of samples 
from classes 1 and 2, the estimated covariance matrix is singular for N > m—2. Tradition- 
ally this point was avoided by restricting consideration to those where the number of 
variables is less than m—2, although Hotre_iinG"!*? explicitly mentions an example with 
four samples and five observables. In (13) HOTTELLING says, “‘some of the information must 
be allowed to go to waste”’ and discards three of the five random variables. He admits his 
dissatisfaction with this procedure and with the alternate approach of removing the singu- 
larity problem by assuming independent variables. A rationale against reducing the number 
of variables to get around the singularity problem in the small sample case, is presented in 
Kanac."'* When the estimated covariance matrix is of rank r < N, then all the sample 
points lie on an r dimensional hyperplane. The suggestion by many authors of using the 
Generalized Inverse, which exists even for singular matrices, is undesirable as it constrains 
the estimated population to the r dimensional hyperplane of sample points. HaRLey''*) 
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presents a class of “pseudo estimates” for the covariance matrix, obtained by adding a term 
proportional to the average variance to each diagonal element of the sample estimate of the 
covariance matrix. It also contains examples showing the superiority of the suggested 
estimates over the Generalized Inverse solution. The Bayesian version and justification 
for the ad hoc approach of (15) appears in ANDO.''® We note that from a Bayesian point of 
view the Generalized Inverse procedure is untenable since it implies a prior distribution 
which assigns zero probability to some part of the space of variates. 

Consideration of the structural assumptions of the model of (8) leads to an explanation 
of the rather pessimistic results for measurement complexity. First, there is no provision in 
the model for the case of independent variables and consequently for the use of more 
variables than allowed by the general model. The mean performance is calculated as an 
average over all problems with various degrees of correlations between variables. This 
tends to lower the average performance from the value corresponding to the case of inde- 
pendent variables. However, in the absence of any knowledge of independence it is not 
unreasonable to obtain average performance over all correlations that are possible. 

Another important assumption responsible for the conservative results on measurement 
complexity is that of lack of continuity in probability values of neighboring measurement 
cells. To illustrate this consider just two variables Y, and Y, each taking values from | to 10, 
with increments of 1. Let us index the 100 measurement states as follows. States $,—S, 
represent states for which (Y,, Y,) = (i, 1) where i goes from | to 10; states S,, to S,, repre- 
sent(i,2)l < i < 10andsoon. If we construct three dimensional probability diagrams where 
one horizontal axis represents the ten values of Y, ,a second horizontal axis ten values of Y,. 
and the vertical axis the probabilities associated with these values, we would normally not 
expect highly discontinuous changes in these probabilities. Reference (8) gives an example 
where the estimated probabilities in a real problem are indeed discontinuous. However, 
there are very many other problems where this does not apply, and in these cases the useful- 
ness of the conservative results for maximum measurement complexity is questionable. 

A further examination of this assumption provides meaning to these numbers and also 
an insight for tightening the design procedure. Continuity provides redundancy and 
redundancy is generally helpful in reducing error. However if estimates of the nature of this 
redundancy are unreliable performance can worsen rather than improve. When we do not 
have enough information for estimating redundancy, it can be removed by quantizing the 
variables no finer than is absolutely necessary. That is we provide just enough levels so that 
knowledge of P[S;/x] does not increase our knowledge of P[S,/x] where j # i. Another way 
to reduce the measurement complexity, when it is necessary to improve average performance, 
is to eliminate relatively insignificant variables. Various suggestions for the selection and 
evaluation of features have been made, and we expect to present an examination of these 
suggestions elsewhere. 


VU. CONCLUSIONS 


On the question of the optimum use ofa fixed size sample to maximize overall confidence 
in the design and testing of pattern classification systems, the only published theoretical 
analysis of the simple partitioning of samples between design and test sets, and the recom- 
mendations based on it, are invalid in the small sample case. In this case a definitely superior 
alternate method is available. 
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On the question of how many variables to use in designing a classification system, the 
probability structure assumed and its correspondence to the real problem at hand determine 
the relationship between the design sample size, the number of variables and the probability 
of error. Earlier investigations led to very optimistic results while recent results have been 
very pessimistic. As we have shown, these are only extremes of a range of possibilities. 
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